Data analysis using ML models with OPTIMEO

Let’s create an experimental_data(temp, conc) function that simulates the yield of a chemical reaction based on temperature, concentration A, concentration B and concentration C.

import numpy as np
import pandas as pd
import plotly.graph_objects as go
import plotly.io as pio
pio.renderers.default = "notebook"

def experimental_data(temp, cA, cB, cC):
    """
    This function simulates experimental data based on temperature and concentrations.
    The function is not based on any real experimental data and is purely for demonstration purposes.
    """
    out = .2*temp + .5*temp*cA + (cA)/3 + (1 - cB)**2/2 + (3 - cC)/1.5 + np.random.normal(0, 0.2, len(temp))
    return out

def generate_data(N=100):
    temp = np.random.uniform(0, 100, N)
    cA = np.random.uniform(0, 1, N)
    cB = np.random.uniform(0, 1, N)
    cC = np.random.uniform(0, 1, N)
    exp_response = experimental_data(temp, cA, cB, cC)
    # Create a DataFrame with the generated data
    df = pd.DataFrame({'temp': temp, 
                       'cA': cA, 
                       'cB': cB, 
                       'cC': cC, 
                       'response': exp_response})
    return df

df = generate_data(50)
df.to_csv('dataML.csv', index=False)
df.head()
temp cA cB cC response
0 65.150929 0.856934 0.143145 0.202648 43.365583
1 70.857440 0.611890 0.588674 0.089296 37.581891
2 84.400019 0.827279 0.546844 0.081419 53.979769
3 32.449034 0.888718 0.713346 0.640800 23.005230
4 64.010182 0.757054 0.611216 0.881803 38.662838

Now, we will use the OPTIMEO package to analyse the data.

from optimeo.analysis import *

data = pd.read_csv('dataML.csv')
factors = data.columns[:-1]
response = data.columns[-1]
analysis = DataAnalysis(data, factors, response)
analysis
DataAnalysis(data=(50, 5), factors=Index(['temp', 'cA', 'cB', 'cC'], dtype='object'), response=response, model_type=None, split_size=0.2, encoders={})

First, let’s look at a simple linear model:

analysis.compute_linear_model()
analysis.linear_model.summary()
OLS Regression Results
Dep. Variable: response R-squared: 0.926
Model: OLS Adj. R-squared: 0.919
Method: Least Squares F-statistic: 139.9
Date: Thu, 17 Apr 2025 Prob (F-statistic): 8.94e-25
Time: 09:43:34 Log-Likelihood: -144.72
No. Observations: 50 AIC: 299.4
Df Residuals: 45 BIC: 309.0
Df Model: 4
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -12.9717 2.653 -4.889 0.000 -18.316 -7.628
temp 0.4709 0.023 20.630 0.000 0.425 0.517
cA 25.9518 2.121 12.239 0.000 21.681 30.223
cB 1.6269 2.381 0.683 0.498 -3.168 6.422
cC -0.0613 2.581 -0.024 0.981 -5.260 5.137
Omnibus: 0.893 Durbin-Watson: 1.991
Prob(Omnibus): 0.640 Jarque-Bera (JB): 0.331
Skew: 0.153 Prob(JB): 0.848
Kurtosis: 3.256 Cond. No. 324.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
figs = analysis.plot_linear_model()
for fig in figs:
    fig.show()

The equation used for the fit is this one, you can change it if you want, e.g to add interaction terms or other polynomial terms:

analysis.write_equation()
'response ~ temp + cA + cB + cC '
analysis.equation = 'response ~ temp+ temp:cA + cA + cB + cC'
analysis.compute_linear_model()
analysis.linear_model.summary()
OLS Regression Results
Dep. Variable: response R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 4.641e+04
Date: Thu, 17 Apr 2025 Prob (F-statistic): 1.09e-80
Time: 09:43:35 Log-Likelihood: 4.5898
No. Observations: 50 AIC: 2.820
Df Residuals: 44 BIC: 14.29
Df Model: 5
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 2.3612 0.179 13.201 0.000 2.001 2.722
temp 0.1984 0.002 83.340 0.000 0.194 0.203
temp:cA 0.5014 0.004 131.248 0.000 0.494 0.509
cA 0.1534 0.224 0.684 0.498 -0.299 0.606
cB -0.4569 0.123 -3.728 0.001 -0.704 -0.210
cC -0.2756 0.132 -2.092 0.042 -0.541 -0.010
Omnibus: 1.678 Durbin-Watson: 1.538
Prob(Omnibus): 0.432 Jarque-Bera (JB): 1.536
Skew: -0.309 Prob(JB): 0.464
Kurtosis: 2.405 Cond. No. 538.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
figs = analysis.plot_linear_model()
for fig in figs:
    fig.show()

Now let’s make a ML model to predict the yield based on the temperature and concentrations of A, B and C.

analysis.model_type = "ElasticNetCV"
# analysis.model_type = "RidgeCV"
# analysis.model_type = "LinearRegression"
# analysis.model_type = "RandomForest"
# analysis.model_type = "GaussianProcess"
# analysis.model_type = "GradientBoosting"
MLmodel = analysis.compute_ML_model()
figs = analysis.plot_ML_model()
for fig in figs:
    fig.show()

And if we want to make a prediction:

new_value = pd.DataFrame({'temp': [50], 
                          'cA': [0.35], 
                          'cB': [0.5], 
                          'cC': [0.5]})
analysis.predict(new_value)
prediction model
0 20.582657 ElasticNetCV
1 20.745541 Linear Model